Uncovering the Hidden Structures in Press Releases of Canadian Federal Parties Issued During the 2015 Election using word2vec and t-SNE

The 2015 Canadian federal election is in its final stretch and at Datacratic, we thought it would be a great opportunity to collect and play with some data.

Each party has a different view on what is important to move Canada forward. This has to effect of centering their official message around certain themes. The question we asked is: can we find patterns in the words used in each party’s press releases? Do some deal more with certain themes than others?

The answer is Yes.

This image is a static scatter plot of press release. Each dot represents a press release, colored by each party’s official color, and press releases are positioned such that the ones that are similar are close-by. The axis are unitless.

The labels (and pink dots) represent major themes that we hand selected and then projected in the 2D space computed by our algorithms to help interpret of the space.

The rest of this notebook is the exact code that was used to generate the scatter plot with MLDB along with an interactive plot at the bottom. The algorithms used to produce this image include word2vec and t-SNE.

Creating the plot



In [458]:

    
import json, pandas, requests
from datetime import datetime
import numpy
from pymldb import Connection
mldb = Connection()

Importing the data

We wrote a small scraper using the scrapy Python library. We scraped the press releases of the NDP, the Liberals, the Green Party and the Conservatives. The resulting data was save in a CSV file with one press release per line.

As mentioned above, we added a few lines representing the major themes we wish to project in our space.



In [717]:

    
ds = mldb.v1.datasets("raw")
ds.put({
    "type": "text.csv.tabular",
    "params": {
        "dataFileUrl":"file:///mldb_data/all_parties_clean.csv",
        "ignoreBadLines": True,
        "rowNamePrefix": "pr"
    }
})









    Out[717]:




PUT http://localhost/v1/datasets/raw
201 Created
 {
  "status": {
    "num_skipped_lines": 150
  }, 
  "config": {
    "params": {
      "ignoreBadLines": true, 
      "rowNamePrefix": "pr", 
      "dataFileUrl": "file:///mldb_data/all_parties_clean.csv"
    }, 
    "type": "text.csv.tabular", 
    "id": "raw"
  }, 
  "state": "ok", 
  "type": "text.csv.tabular", 
  "id": "raw"
}



In [718]:

    
mldb.query("select party, count(*) from raw where to_timestamp(date) > to_timestamp('2015-08-01') group by party")









    Out[718]:






  
    
      
      count(*)
      party
    
    
      _rowName
      
      
    
  
  
    
      ["category"]
      17
      category
    
    
      ["conservative"]
      104
      conservative
    
    
      ["green"]
      90
      green
    
    
      ["liberal"]
      264
      liberal
    
    
      ["ndp"]
      162
      ndp



In [719]:

    
mldb.query("select * from raw limit 1")









    Out[719]:






  
    
      
      date
      full_text
      link
      party
      title
    
    
      _rowName
      
      
      
      
      
    
  
  
    
      pr000000000763
      2015-06-11T00:00:00
      Merci Cline   Membres du Conseil d'administrat...
      http://www.ndp.ca/news/toms-speech-to-board-tr...
      ndp
      Tom's speech to the Board of Trade of Metropol...

Creating bags of words

The next step is to take the full_text column and tokenize its contents. We do this because we will need to compare press releases using their content.

The tokenize function will split the column into individual words, remove words with a length of less than 3 and apply a stoplist to it.



In [720]:

    
bag = mldb.v1.procedures("baggify")
bag.put({
    "type": "transform",
    "params": {
        "inputDataset": "raw",
        "outputDataset": {
            "id": "bag_of_words",
            "type": "beh.mutable"
        },
        "select": """tokenize(full_text, {' ?!;/[]*"' as splitchars, '' as quotechar}) as *""",
        "where": "full_text IS NOT NULL AND title IS NOT NULL AND to_timestamp(date) > to_timestamp('2015-08-01')"
    }
})
mldb.v1.datasets("bag_of_words").delete()
bag.runs.post({})









    Out[720]:




POST http://localhost/v1/procedures/baggify/runs
201 Created
 {
  "runStarted": "2015-10-16T03:01:44.905Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:01:46.383Z", 
  "id": "2015-10-16T03:01:44.901528Z-5bc7042b732cb41f"
}

The result is a clean list of words that will be easy to compare across press releases.



In [721]:

    
mldb.query("select * from bag_of_words limit 1")









    Out[721]:






  
    
      
      accessible
      achieve
      afford
      announced
      balanced
      based
      belief
      budgets
      canada
      canadas
      ...
      tim
      today
      transfer
      transfers
      true
      universally
      values
      weaken
      work
      wreck
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      pr000000000085
      1
      1
      1
      2
      1
      1
      1
      1
      1
      1
      ...
      1
      1
      1
      2
      1
      1
      1
      1
      1
      1
    
  

1 rows × 106 columns

What are the most common words?



In [722]:

    
df = mldb.query("select sum({*}) as * from bag_of_words")
df2 = df.T
df2.columns = ["count"]



In [723]:

    
df2.sort(columns="count", ascending=False)[:25]









    Out[723]:






  
    
      
      count
    
  
  
    
      canadians
      1010
    
    
      canada
      918
    
    
      plan
      834
    
    
      harper
      754
    
    
      mulcair
      731
    
    
      government
      730
    
    
      liberal
      688
    
    
      party
      653
    
    
      canadian
      605
    
    
      trudeau
      602
    
    
      ndp
      550
    
    
      stephen
      520
    
    
      minister
      508
    
    
      economy
      498
    
    
      prime
      479
    
    
      justin
      465
    
    
      green
      447
    
    
      today
      390
    
    
      leader
      382
    
    
      families
      372
    
    
      change
      356
    
    
      conservative
      355
    
    
      canadas
      346
    
    
      jobs
      341
    
    
      middle
      305

Use word2vec to embed our bag of words

The word2vec tool, described here and here, is used to embed words into a high dimensional space.

Word2vec gives us word embeddings, but since what interests us is press releases and each one is represent by a bag of words, we actually need document embedding. This means that for a given press release, we need to embed each of its words and then combine them all to get a representation of the press release in the word2vec space. To acheive this, we simply averaged out each dimension of each press release's word embeddings to get the coordinates for the press release.

Load the trained word2vec data



In [724]:

    
mldb.v1.datasets("w2v").delete()
w2v = mldb.v1.procedures("w2vimport")
w2v.put({
    "type": 'import.word2vec',
    "params": {
        "dataFileUrl": 'file:///mldb_data/GoogleNews-vectors-negative300.bin',
        "output": {
            "type": 'embedding',
            "id": 'w2v'
        }
    }
})
w2v.runs.post({})









    Out[724]:




POST http://localhost/v1/procedures/w2vimport/runs
201 Created
 {
  "runStarted": "2015-10-16T03:01:48.765Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:02:57.330Z", 
  "id": "2015-10-16T03:01:48.764986Z-5bc7042b732cb41f"
}



In [725]:

    
mldb.query("select * from w2v limit 5")









    Out[725]:






  
    
      
      000000
      000001
      000002
      000003
      000004
      000005
      000006
      000007
      000008
      000009
      ...
      000290
      000291
      000292
      000293
      000294
      000295
      000296
      000297
      000298
      000299
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0.152344
      -0.121094
      0.102051
      -0.083984
      -0.184570
      0.015320
      0.238281
      -0.478516
      0.072754
      0.218750
      ...
      0.028320
      -0.164062
      -0.173828
      0.361328
      -0.201172
      -0.142578
      -0.021606
      0.013794
      -0.057373
      0.277344
    
    
      Pragmatix
      0.007721
      0.123047
      0.028076
      -0.157227
      -0.034424
      -0.053955
      0.037598
      -0.071777
      -0.149414
      -0.072754
      ...
      -0.100586
      -0.160156
      -0.042969
      -0.013062
      0.075684
      -0.002808
      0.014893
      -0.011414
      0.184570
      -0.047852
    
    
      griefing
      0.091797
      -0.113281
      0.029907
      0.048584
      -0.241211
      0.195312
      0.034912
      0.302734
      -0.034424
      -0.082520
      ...
      0.104980
      0.233398
      -0.118652
      -0.089355
      -0.142578
      0.062988
      0.061768
      0.054199
      -0.225586
      0.187500
    
    
      Richard_Altomare
      0.056396
      0.455078
      -0.189453
      -0.055420
      0.096191
      -0.296875
      0.195312
      0.240234
      0.224609
      -0.104004
      ...
      -0.139648
      -0.141602
      0.003479
      0.159180
      0.281250
      0.010376
      -0.174805
      0.060547
      -0.025269
      -0.174805
    
    
      quantum_physicist
      0.227539
      -0.076172
      0.248047
      0.112305
      0.004913
      0.146484
      0.025391
      -0.168945
      -0.001282
      0.034180
      ...
      0.167969
      -0.087402
      -0.292969
      0.045898
      -0.058594
      -0.206055
      -0.172852
      -0.164062
      0.094238
      0.099609
    
  

5 rows × 300 columns

Prepare the datasets

We now have the w2v dataset that contains embeddings for words. We also have the bag_of_words dataset that contains the words presents in each press release. To compute the embedding for the press releases, we need to setup a joined dataset.



In [726]:

    
ds = mldb.v1.datasets("tranposed_bag_of_words")
ds.put({
    "type": "transposed",
    "params": {
        "dataset": {"id": "bag_of_words"}
    }
})









    Out[726]:




PUT http://localhost/v1/datasets/tranposed_bag_of_words
201 Created
 {
  "config": {
    "params": {
      "dataset": {
        "id": "bag_of_words"
      }
    }, 
    "type": "transposed", 
    "id": "tranposed_bag_of_words"
  }, 
  "state": "ok", 
  "type": "transposed", 
  "id": "tranposed_bag_of_words"
}



In [727]:

    
proc = mldb.v1.datasets("bow_merged")
proc.put({
    "type": "merged",
    "params": {
        "datasets": [{"id": "tranposed_bag_of_words"}, {"id": "w2v"}]
    }
})









    Out[727]:




PUT http://localhost/v1/datasets/bow_merged
201 Created
 {
  "status": [
    null, 
    null
  ], 
  "config": {
    "params": {
      "datasets": [
        {
          "id": "tranposed_bag_of_words"
        }, 
        {
          "id": "w2v"
        }
      ]
    }, 
    "type": "merged", 
    "id": "bow_merged"
  }, 
  "state": "ok", 
  "type": "merged", 
  "id": "bow_merged"
}



In [776]:

    
rows = requests.get("http://localhost/v1/datasets/bag_of_words/rows").json()
print rows[:5]









    



[u'pr000000000085', u'pr000000000062', u'pr000000001003', u'pr000000000069', u'pr000000000055']

The bow_merged dataset now contains both columns representing the embedding for each word as well as the pr* columns, representing how many times the given word appeared in the given press release.



In [778]:

    
mldb.query("select * from bow_merged limit 1")









    Out[778]:






  
    
      
      000000
      000001
      000002
      000003
      000004
      000005
      000006
      000007
      000008
      000009
      ...
      000290
      000291
      000292
      000293
      000294
      000295
      000296
      000297
      000298
      000299
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0.152344
      -0.121094
      0.102051
      -0.083984
      -0.18457
      0.01532
      0.238281
      -0.478516
      0.072754
      0.21875
      ...
      0.02832
      -0.164062
      -0.173828
      0.361328
      -0.201172
      -0.142578
      -0.021606
      0.013794
      -0.057373
      0.277344
    
  

1 rows × 300 columns

Doing the following query will compute the average of the embeddings for the release represented by rows[0]. The >0 condition translates to: "for the words present in the release".



In [777]:

    
mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+rows[0]+">0")









    Out[777]:






  
    
      
      000000
      000001
      000002
      000003
      000004
      000005
      000006
      000007
      000008
      000009
      ...
      000290
      000291
      000292
      000293
      000294
      000295
      000296
      000297
      000298
      000299
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      []
      -0.011037
      0.041938
      0.002756
      0.117649
      -0.052991
      -0.013769
      0.042946
      -0.062109
      0.107114
      0.02232
      ...
      -0.061832
      0.04426
      -0.03532
      0.039322
      -0.069882
      -0.002956
      -0.024179
      -0.022233
      0.051886
      0.007178
    
  

1 rows × 300 columns

Compute embedding for all press releases

Using what we say above, we can now calculate each press release's embedding with the following code. To get it for all releases, we can do the call $n$ times, once per press release.



In [729]:

    
import urllib
import multiprocessing.dummy
def calc_embedding(pr):
    return (pr, mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+pr+">0"))

p = multiprocessing.dummy.Pool(25)
new_res = p.map(calc_embedding, rows)

We can now save the results in a new dataset.



In [771]:

    
col_names = requests.get("http://localhost/v1/datasets/w2v/columns").json()

ds = mldb.v1.datasets("pr_embed")
ds.put({
    "type":"embedding",
})

for pr, line in new_res:
    if len(line) == 0: continue
    row = []
    for col_name, cell in zip(col_names, line):
        row.append([col_name, float(line[cell]), 0])
    try:
        ds.rows.post({
            "rowName": pr,
            "columns": row
        })
    except:
        print pr, len(line)
        break


ds.commit.post({})









    Out[771]:




POST http://localhost/v1/datasets/pr_embed/commit
200 OK



In [743]:

    
mldb.query("select * from pr_embed where rowName() = '"+rows[0]+"'")









    Out[743]:






  
    
      
      000000
      000001
      000002
      000003
      000004
      000005
      000006
      000007
      000008
      000009
      ...
      000290
      000291
      000292
      000293
      000294
      000295
      000296
      000297
      000298
      000299
    
    
      _rowName
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      pr000000000085
      -0.011037
      0.041938
      0.002756
      0.117649
      -0.052991
      -0.013769
      0.042946
      -0.062109
      0.107114
      0.02232
      ...
      -0.061832
      0.04426
      -0.03532
      0.039322
      -0.069882
      -0.002956
      -0.024179
      -0.022233
      0.051886
      0.007178
    
  

1 rows × 300 columns

Use t-SNE for dimensionality reduction

t-SNE is a very powerful technique that can be used to reduce the dimensionality of data in 2d or 3d to visualise it. We use it to determine how to layout the press releases in the scatter plot.



In [762]:

    
tsne = mldb.v1.procedures("pr_embed_tsne")
tsne.put({
    "type" : "tsne.train",
    "params" : {
        "dataset" : {"id" : "pr_embed"},
        "output" : {"id" : "pr_embed_tsne", "type" : "embedding"},
        "select" : "*",
        "where" : "true",
        "modelFileUrl": "file:///mldb_data/tsne.bin",
        "functionName": "tsne_embed",
        "perplexity": 5
    }
})

mldb.v1.datasets("pr_embed_tsne").delete()

tsne.runs.post({})









    Out[762]:




POST http://localhost/v1/procedures/pr_embed_tsne/runs
201 Created
 {
  "runStarted": "2015-10-16T03:48:41.192Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:48:43.122Z", 
  "id": "2015-10-16T03:48:41.192231Z-5bc7042b732cb41f"
}



In [763]:

    
mldb.v1.datasets("pr_embed_tsne_merged").put({
    "type" : "merged",
    "params" : {
        "datasets": [
            {"id": "raw"},
            {"id": "pr_embed_tsne"}
        ]
    }
})









    Out[763]:




PUT http://localhost/v1/datasets/pr_embed_tsne_merged
201 Created
 {
  "status": [
    {
      "num_skipped_lines": 150
    }, 
    null
  ], 
  "config": {
    "params": {
      "datasets": [
        {
          "id": "raw"
        }, 
        {
          "id": "pr_embed_tsne"
        }
      ]
    }, 
    "type": "merged", 
    "id": "pr_embed_tsne_merged"
  }, 
  "state": "ok", 
  "type": "merged", 
  "id": "pr_embed_tsne_merged"
}

Doing the scatter plot

We now have the pr_embed_tsne_merged dataset that represents all of our press releases with $(x,y)$ coordinates in the embedding space.



In [764]:

    
mldb.query("""
select party, title, x, y 
from pr_embed_tsne_merged 
where to_timestamp(date) > to_timestamp('2015-08-01') 
limit 5""")









    Out[764]:






  
    
      
      party
      title
      x
      y
    
    
      _rowName
      
      
      
      
    
  
  
    
      pr000000000085
      conservative
      Harper Commits To New Funding To Continue Terr...
      17.528767
      61.783394
    
    
      pr000000000062
      conservative
      STATEMENT: Harper Extends Best Wishes On The 7...
      -41.850677
      -54.289703
    
    
      pr000000001003
      liberal
      Fact Check: Desperate NDP makes false claims a...
      -58.890091
      31.784502
    
    
      pr000000000069
      conservative
      Stephen Harper announces plan to save Canadian...
      -21.604961
      58.552570
    
    
      pr000000000055
      conservative
      PM Harper Announces Plan To Strengthen The Nat...
      -14.184937
      -59.870262



In [765]:

    
df = mldb.query("""
select party, title, x, y
from pr_embed_tsne_merged
where to_timestamp(date) > to_timestamp('2015-08-01')
""")



In [766]:

    
import numpy as np
colormap = {
    "ndp": "#FF8000",
    "liberal": "#DF0101",
    "conservative": "#0000FF",
    "green": "#01DF01",
    "category": "#FE2EC8"
}

import bokeh.plotting as bp
from bokeh.models import HoverTool



In [767]:

    
#this line must be in its own cell 
bp.output_notebook()









    




    
        
        
        
    
        
        BokehJS successfully loaded.



In [773]:

    
press_releases = np.array([str(x.encode('ascii','ignore').split("|")[0]) for x in list(df.title.values)])
x = bp.figure(plot_width=900, plot_height=700, title="Press Releases of Canadian Federal Parties During 2015 Elections",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y = df.y.values, 
    color=[colormap[k] for k in df.party.values],
    radius=1,
    fill_alpha=0.5,
    source=bp.ColumnDataSource({"title": press_releases})
).select(dict(type=HoverTool)).tooltips = {"title":"@title"}
bp.show(x)

This plot is interactive. Feel free to explore it by hovering over the different press releases. To tooltip shows the title of each press release.



In [ ]:

	count(*)	party
_rowName
["category"]	17	category
["conservative"]	104	conservative
["green"]	90	green
["liberal"]	264	liberal
["ndp"]	162	ndp

	date	full_text	link	party	title
_rowName
pr000000000763	2015-06-11T00:00:00	Merci Cline Membres du Conseil d'administrat...	http://www.ndp.ca/news/toms-speech-to-board-tr...	ndp	Tom's speech to the Board of Trade of Metropol...

	count
canadians	1010
canada	918
plan	834
harper	754
mulcair	731
government	730
liberal	688
party	653
canadian	605
trudeau	602
ndp	550
stephen	520
minister	508
economy	498
prime	479
justin	465
green	447
today	390
leader	382
families	372
change	356
conservative	355
canadas	346
jobs	341
middle	305

	000000	000001	000002	000003	000004	000005	000006	000007	000008	000009	...	000290	000291	000292	000293	000294	000295	000296	000297	000298	000299
_rowName
0	0.152344	-0.121094	0.102051	-0.083984	-0.184570	0.015320	0.238281	-0.478516	0.072754	0.218750	...	0.028320	-0.164062	-0.173828	0.361328	-0.201172	-0.142578	-0.021606	0.013794	-0.057373	0.277344
Pragmatix	0.007721	0.123047	0.028076	-0.157227	-0.034424	-0.053955	0.037598	-0.071777	-0.149414	-0.072754	...	-0.100586	-0.160156	-0.042969	-0.013062	0.075684	-0.002808	0.014893	-0.011414	0.184570	-0.047852
griefing	0.091797	-0.113281	0.029907	0.048584	-0.241211	0.195312	0.034912	0.302734	-0.034424	-0.082520	...	0.104980	0.233398	-0.118652	-0.089355	-0.142578	0.062988	0.061768	0.054199	-0.225586	0.187500
Richard_Altomare	0.056396	0.455078	-0.189453	-0.055420	0.096191	-0.296875	0.195312	0.240234	0.224609	-0.104004	...	-0.139648	-0.141602	0.003479	0.159180	0.281250	0.010376	-0.174805	0.060547	-0.025269	-0.174805
quantum_physicist	0.227539	-0.076172	0.248047	0.112305	0.004913	0.146484	0.025391	-0.168945	-0.001282	0.034180	...	0.167969	-0.087402	-0.292969	0.045898	-0.058594	-0.206055	-0.172852	-0.164062	0.094238	0.099609

	000000	000001	000002	000003	000004	000005	000006	000007	000008	000009	...	000290	000291	000292	000293	000294	000295	000296	000297	000298	000299
_rowName
[]	-0.011037	0.041938	0.002756	0.117649	-0.052991	-0.013769	0.042946	-0.062109	0.107114	0.02232	...	-0.061832	0.04426	-0.03532	0.039322	-0.069882	-0.002956	-0.024179	-0.022233	0.051886	0.007178

	000000	000001	000002	000003	000004	000005	000006	000007	000008	000009	...	000290	000291	000292	000293	000294	000295	000296	000297	000298	000299
_rowName
pr000000000085	-0.011037	0.041938	0.002756	0.117649	-0.052991	-0.013769	0.042946	-0.062109	0.107114	0.02232	...	-0.061832	0.04426	-0.03532	0.039322	-0.069882	-0.002956	-0.024179	-0.022233	0.051886	0.007178